model : add LightOnOCR-1B model #16764

ngxson · 2025-10-24T23:15:35Z

Seems like the "OCR model race" has started. This seems to be one of the few "low hanging fruits" that we can easily support in llama.cpp

The model features:

Qwen3 as language model
Mistral3 as vision encoder (the difference is that LightOnOCR does not use [IMG_BREAK] token)

Original model: https://huggingface.co/lightonai/LightOnOCR-1B-1025

GGUF model: https://huggingface.co/ggml-org/LightOnOCR-1B-1025-GGUF

To try it:

llama-server -hf ggml-org/LightOnOCR-1B-1025-GGUF -c 8192

# open https://localhost:8080 and try uploading an image

Important note: this model requires specific input structure, see the chat template

The structure seems to be:

Starts with an empty system message
Then, an user message. All images must be contained in this message; No instructions are needed

Example:

{
  "messages": [{
    "role": "system",
    "content": ""
  }, {
    "role": "user",
    "content": [{
      "type": "image_url",
      "image_url": {"url": "data:image/png;base64,......"}
    }]
  }],
}

ggerganov

Very cool!

The command in OP should be llama-server instead of llama-cli.

This commit introduces comprehensive support for Qwen3-VL vision-language models, including both the dense variant and the Mixture-of-Experts (MoE) architecture with DeepStack fusion capabilities. ## Overview Qwen3-VL represents Alibaba's advanced multimodal models capable of understanding and reasoning about images alongside text. This implementation enables running these models for various vision-language tasks including image understanding, optical character recognition (OCR), visual question answering, and document analysis. ## Architecture Implementation ### Core Architecture (llama-arch.cpp/h) - **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture - **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing - Complete tensor mapping registration for both architectures - Architecture-specific parameter handling and validation ### Model Loading (llama-model.cpp) **Hyperparameter Loading** - QWEN3_VL: Standard dense model configuration * Uses full n_embd dimension throughout * 36 layers for 4B parameter variant - QWEN3_VL_MOE: Expert-based configuration * 4x n_embd expansion (n_embd/4 per channel × 4 channels) * 48 layers (30B-A3B) or 94 layers (235B-A22B) * Expert feed-forward network dimensions **Multi-axis Rotary Position Embedding (M-RoPE)** - Configured rope_sections = [24, 20, 20, 0] * Temporal dimension: 24 dims * Height dimension: 20 dims * Width dimension: 20 dims * Unused dimension: 0 - Enables spatial awareness for image patch processing - Added debug logging for MRoPE configuration verification **Tensor Initialization** - QWEN3_VL follows QWEN3 dense structure * Token embeddings, output projection * Per-layer: attention (Q/K/V/O), normalization, FFN - QWEN3_VL_MOE includes expert-specific tensors * Expert gate networks for routing * Per-expert FFN weights (gate, down, up) * Shared and expert-specific parameters ### Graph Building (llama-graph.cpp/h) **DeepStack Architecture for MoE** The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism: 1. **Channel Splitting**: Vision embeddings split into 3 processing channels - ds0, ds1, ds2 (DeepStack channels 0, 1, 2) - Each channel: n_embd/4 dimensions 2. **Per-layer Processing**: Independent expert selection per channel - Token-level expert routing - Gated mixture-of-experts computation - Q/K normalization before attention 3. **Fusion Layers**: Learned merging at early transformer layers - Fusion occurs at layers 0, 1, and 2 - DeepStack merger combines information across channels - Only active when vision embeddings present (text-only safe) **Batch Processing** - Enhanced position array handling for M-RoPE multi-dimensional positions - Proper ubatch preparation distinguishing vision vs text tokens - Conditional graph construction based on modality ### Vision Processing (clip.cpp/clip-impl.h) **PROJECTOR_TYPE_QWEN3VLMOE** - New projector type for Qwen3-VL-MoE vision encoder - Handles projection from vision encoder to language model space **DeepStack Merger Implementation** The merger is a learnable 2-layer MLP with normalization: ``` Input (3 channels) → LayerNorm(norm_w, norm_b) → Linear(fc1_w, fc1_b) → GELU activation → Linear(fc2_w, fc2_b) → Output (fused representation) ``` Components: - `norm_w`, `norm_b`: Layer normalization parameters - `fc1_w`, `fc1_b`: First linear projection - `fc2_w`, `fc2_b`: Second linear projection **Spatial Operations** - Fixed spatial merge for vision patch sequences - Proper handling of patch grid dimensions - Vision-text boundary management **Safety Improvements** - Removed illegal zero-tensor initialization for text-only inputs - Conditional fusion: only processes when vision embeddings exist - Prevents memory access violations in text-only inference ### Platform Support (llama-model-loader.cpp) **Windows File Handle Limit** - Increased stdio limit to 2048 handles (from default ~512) - Critical for MoE models with many expert weight files - Uses `_setmaxstdio()` on Windows platform - Prevents "too many open files" errors during loading ### Reference Patches (llama/patches/) Included for transparency and reproducibility: - `0033-qwen3vl-base-architecture.patch` - `0034-qwen3vl-deepstack-implementation.patch` - `0035-qwen3vl-memory-fix.patch` - `0036-qwen3vl-layer-norm-bias.patch` ## Technical Specifications ### Qwen3-VL (Dense) - **Type**: Standard transformer with integrated vision encoder - **Layers**: 36 (4B parameter model) - **Embedding**: Full n_embd dimension - **Position Encoding**: M-RoPE with 4 dimensional sections - **Use Cases**: General vision-language understanding ### Qwen3-VL-MoE (Mixture of Experts) - **Type**: Sparse MoE with DeepStack fusion - **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B) - **Embedding**: 4-channel architecture (n_embd/4 per channel) - **Experts**: Multiple expert networks per layer with learned routing - **Fusion**: 3-layer early fusion (layers 0, 1, 2) - **Use Cases**: High-quality vision understanding at improved efficiency ### DeepStack Fusion Mechanism The multi-channel fusion enables: 1. **Parallel Processing**: Different aspects of vision processed independently 2. **Early Integration**: Information merged in early transformer layers 3. **Adaptive Routing**: Expert selection per channel and token 4. **Efficiency**: Sparse activation patterns reduce computation ## Capabilities Enabled This implementation supports: - **Multimodal Chat**: Conversational AI with image understanding - **Image Captioning**: Detailed image descriptions - **Visual Question Answering**: Answer questions about image content - **Optical Character Recognition**: Extract text from images - **Document Understanding**: Analyze documents, tables, charts - **Image Analysis**: Detailed visual scene understanding ## References and Acknowledgments This implementation is based on the outstanding work by the community: **Primary Source Repository** - Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/ - Author: LETS-BEE **Source Commits** (applied in llama/patches/): 1. Base Architecture LETS-BEE/llama.cpp@9971912 2. DeepStack Implementation LETS-BEE/llama.cpp@b913e89 3. Memory Access Fix LETS-BEE/llama.cpp@de0e3d3 4. Layer Normalization Update LETS-BEE/llama.cpp@e45aecb **Related Discussions and Pull Requests** - Upstream llama.cpp Discussion: ggml-org/llama.cpp#16207 (comment) - Upstream llama.cpp PR: ggml-org/llama.cpp#16745 - Related Ollama PR: ollama#12665 **Additional Context** - OCR-related discussion: ggml-org/llama.cpp#16764 ## Testing Tested with: - Qwen3-VL 4B parameter models (dense) - Qwen3-VL-MoE 30B-A3B models (MoE) - Various image understanding tasks - Text-only and multimodal inference modes ## Future Work Potential enhancements: - Additional model size variants - Performance optimizations for DeepStack fusion - Extended M-RoPE configuration options - Enhanced vision preprocessing pipelines --- Special thanks to the llama.cpp community and all contributors who made this multimodal vision-language support possible.

amritsingh183 · 2025-10-27T15:05:21Z

Awesome !!

model : add LightOnOCR-1B model

a51c6b1

ngxson requested a review from CISC as a code owner October 24, 2025 23:15

github-actions bot added examples python python script changes labels Oct 24, 2025

CISC approved these changes Oct 25, 2025

View reviewed changes

ggerganov approved these changes Oct 26, 2025

View reviewed changes

ngxson added 2 commits October 27, 2025 15:36

add test

62cc684

Merge branch 'master' into xsn/lighton-ocr

4718203

ngxson merged commit c55d53a into ggml-org:master Oct 27, 2025
64 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

model : add LightOnOCR-1B model #16764

model : add LightOnOCR-1B model #16764

ngxson commented Oct 24, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

amritsingh183 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

model : add LightOnOCR-1B model #16764

model : add LightOnOCR-1B model #16764

Conversation

ngxson commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amritsingh183 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Oct 24, 2025 •

edited

Loading